The original formulation of gradient descent just takes a step in parameter space away from the direction of the steepest grade:
This strategy forces us to choose a single learning rate,
Some things we can do without trying a new algorithm:
-
Learning rate: We can improve this situation by allowing
to decrease over time. This gives us the benefit of fast learning early on, while making it easier to converge as we get close to an optimum. -
Stochastic gradient noise: We can also throw a noise term in there, to help us avoid getting stuck at local optima.
-
Random restarts: We can run the process from multiple random starting points to increase our chance of finding a global optimum.
-
Early stopping: We can stop the process when the rate of change in the error falls below some threshold.